Distributed Machine Learning with PySpark by 2023

Distributed Machine Learning with PySpark by 2023

Author:2023
Language: eng
Format: mobi, epub
Published: 2023-11-23T16:43:21.136000+00:00


Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD pyspark In the preceding code, we first obtain the feature importances and their corresponding names, creating a list of tuples that pair each feature with its importance score. Next, we sort this list in descending order based on importance scores, enabling us to identify the most influential features. Finally, we print the sorted feature importances.

The output indicates that the petal length and width of the flowers are the driving force behind the model’s predictions. In the PySpark model, the sepal length and width are redundant as their score is 0, respectively.

Step 5: Model evaluation

In this step, a custom function evaluate_model is defined. It takes the trained model and the test data as input. The function makes predictions using the model and evaluates its performance using the MulticlassClassificationEvaluator from PySpark. It returns the accuracy of the model and the predictions.

After calling the functions in step 6, we get an output indicating an accuracy of 94%. This indicates that the decision tree model is quite effective at making correct predictions on the test data. It correctly classifies the target variable for a large majority of instances, which is generally a good sign of the model’s predictive power.

It’s worth noting that the accuracy of the PySpark model is lower than that of Scikit-Learn, which was 100%. This difference arises because we are building models with default hyperparameters, and the two frameworks have distinct default settings. We will explore how to customize algorithm hyperparameters in Chapter 16.

Bringing It All Together

In the preceding sections of this chapter, we provided code that demonstrated each modeling step independently. However, in this section, our goal is to combine all the relevant code from those steps into a single code block. This allows data scientists to execute the code as a unified entity.

Scikit-Learn

[In]: from sklearn.datasets import load_iris

[In]: from sklearn.model_selection import train_test_split

[In]: from sklearn.tree import DecisionTreeClassifier

[In]: from sklearn.metrics import accuracy_score

235

Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD pyspark

[In]: def load_iris_dataset():

"""

Load and return the Iris dataset features (X) and labels

(y).

Returns:

X (numpy.ndarray): Feature matrix of the Iris dataset.

y (numpy.ndarray): Target labels of the Iris dataset.

"""

iris = load_iris()

X = iris.data

y = iris.target

return X, y

[In]: def split_data(X, y, test_size=0.2, random_state=42):

"""

Split data into training and test sets.

Args:

X (numpy.ndarray): Feature matrix of the dataset.

y (numpy.ndarray): Target labels of the dataset.

test_size (float): Proportion of data to include in the test

set (default=0.2).

random_state (int): Seed for random number generation (

default=42).

Returns:

X_train (numpy.ndarray): Feature matrix of the training set.

X_test (numpy.ndarray): Feature matrix of the test set.

y_train (numpy.ndarray): Target labels of the training set.

y_test (numpy.ndarray): Target labels of the test set.

"""

X_train, X_test, y_train, y_test = train_test_split(X, y,

test_size=test_size, random_state=random_state)

return X_train, X_test, y_train, y_test

[In]: def build_decision_tree(X, y):

"""

236

Chapter 8 DeCision tree ClassifiCation with panDas, sCikit-learn, anD pyspark Build and return a decision tree classifier using the

provided features (X) and labels (y).

Args:

X (numpy.ndarray): Feature matrix of the dataset.

y (numpy.ndarray): Target labels of the dataset.

Returns:

clf (DecisionTreeClassifier): Trained decision tree

classifier.

"""

clf = DecisionTreeClassifier()

clf.fit(X, y)

return clf

[In]: def evaluate_model(clf, X_test, y_test):

"""

Evaluate and return the accuracy score of the provided

classifier (clf) on the test features (X_test) and labels

(y_test).



Download



Copyright Disclaimer:
This site does not store any files on its server. We only index and link to content provided by other sites. Please contact the content providers to delete copyright contents if any and email us, we'll remove relevant links or contents immediately.